# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("darkgrid") #use Seaborn styles
sns.set_context("talk")
This goal of this project is to compare current cancer rates from 2014-2018 with the number and amount of TRI pollutant exceedances from 2007-2011 by county and determine if there is a correlation between the two. Twenty random counties will be chosen from the counties available from the National Institutes of Health (NIH) National Cancer Institute (NCI) State Cancer Profiles. The water violation data will then be obtained for each county or comparison.
The current cancer incidence rates were obtained through a search of all types of cancer from 2014-2108, including all races, both sexes, and all ages. Results were obtained for the United States and organized by county. The table is sorted with highest rates at the top and lowest rates at the bottom.
The Incidence Rate is cases per 100,000 population per year and is age-adjusted to the 2000 US standard population.
Source: Cancer Incidence Rate Report for United States by County
# Load the current cancer rates into a dataframe
currentCancerRates = pd.read_csv("cancerRates_full.csv")
# Display first and last 5 rows
currentCancerRates
| State | FIPS | Met Healthy People Objective of ***? | Age-Adjusted Incidence Rate([rate note]) - cases per 100,000 | Lower 95% Confidence Interval | Upper 95% Confidence Interval | CI*Rank([rank note]) | Lower CI (CI*Rank) | Upper CI (CI*Rank) | Average Annual Count | Recent Trend | Recent 5-Year Trend ([trend note]) in Incidence Rates | Lower 95% Confidence Interval.1 | Upper 95% Confidence Interval.1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | US (SEER+NPCR)(1) | 0.0 | *** | 448.6 | 448.3 | 448.9 | N/A | N/A | N/A | 1703249 | falling | -0.9 | -1.1 | -0.7 |
| 1 | Union County, Florida(6) | 12125.0 | *** | 1136.4 | 1066.6 | 1209.9 | N/A | 1 | 1 | 214 | stable | 0 | -1 | 1.1 |
| 2 | Richmond County, Virginia(6) | 51159.0 | *** | 784.2 | 714.8 | 859.3 | N/A | 1 | 3 | 99 | rising | 35.6 | 20 | 53.2 |
| 3 | Lexington City, Virginia(6) | 51678.0 | *** | 727.5 | 633.8 | 832.3 | N/A | 1 | 9 | 52 | rising | 28.1 | 6.9 | 53.5 |
| 4 | Palo Alto County, Iowa(7) | 19147.0 | *** | 661.3 | 594.5 | 733.9 | N/A | 1 | 7 | 83 | stable | 0.8 | 0 | 1.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3173 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3174 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3175 | Interpret Rankings provides insight into inter... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3176 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3177 | Data for United States does not include Puerto... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3178 rows × 14 columns
currentCancerRates.dtypes
State object FIPS float64 Met Healthy People Objective of ***? object Age-Adjusted Incidence Rate([rate note]) - cases per 100,000 object Lower 95% Confidence Interval object Upper 95% Confidence Interval object CI*Rank([rank note]) object Lower CI (CI*Rank) object Upper CI (CI*Rank) object Average Annual Count object Recent Trend object Recent 5-Year Trend ([trend note]) in Incidence Rates object Lower 95% Confidence Interval.1 object Upper 95% Confidence Interval.1 object dtype: object
# Remove all rows of extraneous information at the end of the spreadsheet that we don't need for our analyses
currentCancerRates = currentCancerRates.drop(currentCancerRates.index[3142:])
# Since we don't need the US average for our analysis, we will remove it
currentCancerRates = currentCancerRates.drop(0)
# Remove all the columns we don't need
currentCancerRates = currentCancerRates.drop([' FIPS', 'Met Healthy People Objective of ***?', 'Lower 95% Confidence Interval',
'Upper 95% Confidence Interval', 'CI*Rank([rank note])', 'Lower CI (CI*Rank)', 'Upper CI (CI*Rank)',
'Recent Trend', 'Recent 5-Year Trend ([trend note]) in Incidence Rates',
'Lower 95% Confidence Interval.1', 'Upper 95% Confidence Interval.1'], axis=1)
# The county is needed to match the water violation data, so we need to split the information under state
# into a column for County and a column for State
# First rename State to ST
currentCancerRates = currentCancerRates.rename(columns={"State":"ST"})
#Split ST into separate columns of County and State
currentCancerRates[['County', 'State']]=currentCancerRates.ST.str.split(",", expand=True)
# Then remove original messy state information
currentCancerRates = currentCancerRates.drop('ST', axis=1)
# Move County and State in front of incidence rate
firstCol = currentCancerRates.pop('County')
secondCol = currentCancerRates.pop('State')
currentCancerRates.insert(0, 'County', firstCol)
currentCancerRates.insert(1, 'State', secondCol)
# Remove numbers and parentheses from end of State column
currentCancerRates['State']=currentCancerRates['State'].map(lambda x: str(x)[:-3])
#Clean up Incidence Rate Title
currentCancerRates = currentCancerRates.rename(columns={"Age-Adjusted Incidence Rate([rate note]) - cases per 100,000":"Cancer Incidence Rate"})
# Change * for low rates to 0
currentCancerRates['Cancer Incidence Rate']=currentCancerRates['Cancer Incidence Rate'].replace("*",0.0)
# Change Average Annual counts from "3 or fewer" to 0
currentCancerRates['Average Annual Count']=currentCancerRates['Average Annual Count'].replace("3 or fewer",0)
# Convert numeric column data to float types to be used in analyses
currentCancerRates['Cancer Incidence Rate'] = pd.to_numeric(currentCancerRates['Cancer Incidence Rate'])
currentCancerRates['Cancer Incidence Rate'] = pd.to_numeric(currentCancerRates['Cancer Incidence Rate'])
currentCancerRates
| County | State | Cancer Incidence Rate | Average Annual Count | |
|---|---|---|---|---|
| 1 | Union County | Florida | 1136.4 | 214 |
| 2 | Richmond County | Virginia | 784.2 | 99 |
| 3 | Lexington City | Virginia | 727.5 | 52 |
| 4 | Palo Alto County | Iowa | 661.3 | 83 |
| 5 | Powell County | Kentucky | 631.5 | 91 |
| ... | ... | ... | ... | ... |
| 3137 | Woodson County | Kansas | 0.0 | 0 |
| 3138 | Wright County | Minnesota | 0.0 | 0 |
| 3139 | Wyandotte County | Kansas | 0.0 | 0 |
| 3140 | Yakutat City and Borough | Alaska | 0.0 | 0 |
| 3141 | Yellow Medicine County | Minnesota | 0.0 | 0 |
3141 rows × 4 columns
Since we need numeric types for the numeric data, let's start by checking our current data types:
currentCancerRates.dtypes
County object State object Cancer Incidence Rate float64 Average Annual Count object dtype: object
The Average Annual Count needs to be converted to numeric types first using the to_numeric() method
currentCancerRates['Average Annual Count'] = pd.to_numeric(currentCancerRates['Average Annual Count'])
currentCancerRates.dtypes
County object State object Cancer Incidence Rate float64 Average Annual Count int64 dtype: object
The data is arranged by county, but we will be looking at the data at the state level, so we need to aggregate the counts by state
# Aggregate State Average Annual Counts
stateCountAgg = currentCancerRates.groupby('State')['Average Annual Count'].sum()
stateCountAgg.head()
State Alabama 26710 Alaska 2952 Arizona 32347 Arkansas 17339 California 171163 Name: Average Annual Count, dtype: int64
Histogram of State Aggregate Average Annual Count
# Plot bar chart of State Aggregate Counts
# Figure size
fig = plt.figure(figsize=(16,9))
ax = stateCountAgg.plot.bar(x='State', y='Average Annual Count', rot=90)
plt.title("Average Annual Count of Cancer Cases by State");
Unfortunately, this only shows us total counts, without taking into consideration the population of each state. Right now, it looks like California, Florida, New York, and Texas have very high cancer counts. Let's see if that holds true after we account for population.
First, we need to import population data. The 2022 state populations were obtained from World Population Review.
popStates = pd.read_csv("popStates.csv", usecols=["State", "Pop"])
popStates.sort_values(by=['State'], inplace=True)
popStates.head()
| State | Pop | |
|---|---|---|
| 23 | Alabama | 4949697 |
| 48 | Alaska | 720763 |
| 13 | Arizona | 7640796 |
| 33 | Arkansas | 3042017 |
| 0 | California | 39664128 |
Add the cancer rates to the population dataframe:
popStates['Average Annual Count'] = stateCountAgg.values
#Set the cancer rate to cases per 100000
popStates['Cancer Rate'] = popStates['Average Annual Count']/popStates.Pop*100000
# Round the cancer rate to 2 decimal places
popStates =popStates.round(2)
popStates.head()
| State | Pop | Average Annual Count | Cancer Rate | |
|---|---|---|---|---|
| 23 | Alabama | 4949697 | 26710 | 539.63 |
| 48 | Alaska | 720763 | 2952 | 409.57 |
| 13 | Arizona | 7640796 | 32347 | 423.35 |
| 33 | Arkansas | 3042017 | 17339 | 569.98 |
| 0 | California | 39664128 | 171163 | 431.53 |
# Plot bar chart of population-adjusted State Aggregate Counts
# Figure size
plt.rcParams["figure.figsize"]=(16,9)
ax = popStates.plot.bar(x='State', y='Cancer Rate', rot=90);
plt.title("Cancer cases per 10000 people per state");
ax.axhline(popStates['Cancer Rate'].mean(), color='green', linewidth=2)
ax.annotate('Mean Cancer Rate = 503', xy=(1, 503), xytext=(0, 600), size=20, arrowprops=dict(arrowstyle="fancy", fc="black",
connectionstyle="arc3, rad=0.5"));
Now that the cancer counts have been adjusted to population, we have better data that has removed the large variability. The states with the largest cancer rates are now Maine and West Virginia. Cancer rates are lowest in Nevada and Utah. Kansas and Minnesota both have 0, which is most likely due to lack of data as opposed to low cancer rates.
This bar graph is better than the previous one, but the data is still difficult to see and understand. It will be easier to view this data on a heatmap of the US.
# Add state codes to popStates dataframe to align cancer data to states on map
popStates['ST'] = ['AL', 'AK', 'AZ' , 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY',
'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
'OK', 'OR', 'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']
# Build the choropleth
import plotly.express as px
fig = px.choropleth(popStates,
locationmode="USA-states",
locations='ST',
color='Cancer Rate',
color_continuous_scale="YlOrRd",
range_color=(300, 700), #Setting scale to majority of cancer rates. States with zero will still be colored in light yellow.
scope="usa",
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# Improve the legend
fig.update_layout(coloraxis_colorbar=dict(
thicknessmode="pixels", thickness=10,
lenmode="pixels", len=350,
yanchor="top", y=0.8,
ticks="outside",
dtick=50
))
fig.update_layout(title_text=("Population-adjusted Cancer Incidence Rates in the US, cases per 100000"),
title_x=0.5, title_y=.95)
fig.show()
The next step will be to obtain the water pollutant violation data for 20 randomly chosen counties in the US. The water violation data was obtained from the US Environmental Protection Agency through the Enforcement and Compliance History Online tool. TRI and DMR Comparison Dashboard was used to find the total Toxic Release Inventory (TRI) and Discharge Monitoring Reports (DMR) pollutant load in pounds/year. The sum of the total toxic weighted pounds equivalent from both reports was used from 2010, a date that provided more consistent recordings but was far enough back in time to allow the time for the effect of carcinogens to begin to present themselves in the population.
The total weighted pounds equivalent is the mass of a pollutant or chemical discharged that accounts for its relative toxicity.
Example of Search Data:
The cancer rate data contains 3141 counties, so 20 random counties will be chosen between 1 and 3141. A seed will be chosen to make it easier to track all information throughout the analysis.
# Generate 20 random indexes for counties
rng = np.random.RandomState(30)
counties = pd.DataFrame(np.zeros(20, dtype=int), index = rng.randint(1, 3141, size=20))
counties
| 0 | |
|---|---|
| 1830 | 0 |
| 422 | 0 |
| 2990 | 0 |
| 501 | 0 |
| 1165 | 0 |
| 920 | 0 |
| 899 | 0 |
| 1846 | 0 |
| 764 | 0 |
| 1170 | 0 |
| 1156 | 0 |
| 2474 | 0 |
| 264 | 0 |
| 1912 | 0 |
| 690 | 0 |
| 1966 | 0 |
| 164 | 0 |
| 2579 | 0 |
| 1043 | 0 |
| 3061 | 0 |
# Merge county name and cancer incidence rate for selected indexes
counties = counties.merge(currentCancerRates.County, left_index=True, right_index=True)
counties = counties.merge(currentCancerRates['Cancer Incidence Rate'], left_index=True, right_index=True)
# remove '0' column
counties.drop(labels=0, axis=1, inplace=True)
counties.head()
| County | Cancer Incidence Rate | |
|---|---|---|
| 1830 | Tuscaloosa County | 441.4 |
| 422 | Saline County | 506.5 |
| 2990 | Faribault County | 0.0 |
| 501 | Putnam County | 501.4 |
| 1165 | Bristol County | 470.4 |
currentCancerRates.loc[1830]
County Tuscaloosa County State Alabama Cancer Incidence Rate 441.4 Average Annual Count 926 Name: 1830, dtype: object
# Load the current cancer rates into a dataframe
waterViolations = pd.read_csv("waterViolations.csv")
waterViolations
| County | TWPE Released | Total Pounds Released | State | |
|---|---|---|---|---|
| 0 | Tuscaloosa County | 38726 | 87340975 | Alabama |
| 1 | Saline County | 1653 | 5870000 | Nebraska |
| 2 | Faribault County | 102 | 9357619 | Minnesota |
| 3 | Putnam County | 72 | 151193 | Indiana |
| 4 | Bristol County | 650 | 2040000 | Rhode Island |
| 5 | Union County | 2 | 1080 | Iowa |
| 6 | Keokuk County | 0 | 88296 | Iowa |
| 7 | Mobile County | 117712 | 379029517 | Alabama |
| 8 | Windham County | 731 | 720718 | Connecticut |
| 9 | Carbon County | 217 | 29761 | Pennsylvania |
| 10 | Huron County | 152 | 10330001 | Ohio |
| 11 | Iron County | 89 | 65627 | Michigan |
| 12 | Franklin County | 219 | 47146 | Mississippi |
| 13 | Stanley County | 24 | 79825 | South Dakota |
| 14 | Madison County | 1390 | 473155 | New York |
| 15 | Huntingdon County | 1188 | 7470000 | Pennsylvania |
| 16 | Chenango County | 89 | 1577131 | New York |
| 17 | Karnes County | 6089 | 10510000 | Texas |
| 18 | Knox County | 66 | 460555 | Illinois |
| 19 | Mower County | 868 | 16820000 | Minnesota |
# Add pollution data to counties dataframe with cancer rates
# Align indexes so data has correct overlap
counties.index = waterViolations.index
counties['TWPE Released'] = waterViolations['TWPE Released']
counties['Total Pounds Released'] = waterViolations['Total Pounds Released']
counties
| County | Cancer Incidence Rate | TWPE Released | Total Pounds Released | |
|---|---|---|---|---|
| 0 | Tuscaloosa County | 441.4 | 38726 | 87340975 |
| 1 | Saline County | 506.5 | 1653 | 5870000 |
| 2 | Faribault County | 0.0 | 102 | 9357619 |
| 3 | Putnam County | 501.4 | 72 | 151193 |
| 4 | Bristol County | 470.4 | 650 | 2040000 |
| 5 | Union County | 480.2 | 2 | 1080 |
| 6 | Keokuk County | 481.2 | 0 | 88296 |
| 7 | Mobile County | 440.6 | 117712 | 379029517 |
| 8 | Windham County | 488.0 | 731 | 720718 |
| 9 | Carbon County | 470.2 | 217 | 29761 |
| 10 | Huron County | 470.8 | 152 | 10330001 |
| 11 | Iron County | 397.9 | 89 | 65627 |
| 12 | Franklin County | 521.7 | 219 | 47146 |
| 13 | Stanley County | 436.8 | 24 | 79825 |
| 14 | Madison County | 491.4 | 1390 | 473155 |
| 15 | Huntingdon County | 433.8 | 1188 | 7470000 |
| 16 | Chenango County | 534.8 | 89 | 1577131 |
| 17 | Karnes County | 386.4 | 6089 | 10510000 |
| 18 | Knox County | 475.8 | 66 | 460555 |
| 19 | Mower County | 0.0 | 868 | 16820000 |
TWPE = sns.boxplot(x=counties['TWPE Released']).set(title="Boxplot of TWPE Released Data");
Total = sns.boxplot(x=counties['Total Pounds Released']).set(title="Boxplot of Total Pounds Released Data")
There are 3 obvious outliers in each of the boxplots, so we will remove those: Tuscaloosa, Mobile, and Karnes.
#Drop Tuscaloosa, Mobile, and Karnes data
counties.drop(labels=0, axis=0, inplace=True)
counties.drop(labels=7, axis=0, inplace=True)
counties.drop(labels=17, axis=0, inplace=True)
counties
| County | Cancer Incidence Rate | TWPE Released | Total Pounds Released | |
|---|---|---|---|---|
| 1 | Saline County | 506.5 | 1653 | 5870000 |
| 2 | Faribault County | 0.0 | 102 | 9357619 |
| 3 | Putnam County | 501.4 | 72 | 151193 |
| 4 | Bristol County | 470.4 | 650 | 2040000 |
| 5 | Union County | 480.2 | 2 | 1080 |
| 6 | Keokuk County | 481.2 | 0 | 88296 |
| 8 | Windham County | 488.0 | 731 | 720718 |
| 9 | Carbon County | 470.2 | 217 | 29761 |
| 10 | Huron County | 470.8 | 152 | 10330001 |
| 11 | Iron County | 397.9 | 89 | 65627 |
| 12 | Franklin County | 521.7 | 219 | 47146 |
| 13 | Stanley County | 436.8 | 24 | 79825 |
| 14 | Madison County | 491.4 | 1390 | 473155 |
| 15 | Huntingdon County | 433.8 | 1188 | 7470000 |
| 16 | Chenango County | 534.8 | 89 | 1577131 |
| 18 | Knox County | 475.8 | 66 | 460555 |
| 19 | Mower County | 0.0 | 868 | 16820000 |
#Rerun boxplots
TWPE = sns.boxplot(x=counties['TWPE Released']).set(title="Boxplot of TWPE Released Data");
Total = sns.boxplot(x=counties['Total Pounds Released']).set(title="Boxplot of Total Pounds Released Data")
There still appears to be an outlier in the Total Pounds Released Data, but we're going to move ahead to scatterplots and correlations.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(counties['TWPE Released'], counties['Cancer Incidence Rate'])
ax.set_xlabel('TWPE Released')
ax.set_ylabel('Cancer Incidence Rate')
plt.show()
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(counties['Total Pounds Released'], counties['Cancer Incidence Rate'])
ax.set_xlabel('Total Pounds Released')
ax.set_ylabel('Cancer Incidence Rate')
plt.show()
It is possible that the counties with no cancer rates simply did not submit data, so we will remove those data points.
#Remove 0 cancer rates
counties.drop(labels=2, axis=0, inplace=True)
counties.drop(labels=19, axis=0, inplace=True)
counties
| County | Cancer Incidence Rate | TWPE Released | Total Pounds Released | |
|---|---|---|---|---|
| 1 | Saline County | 506.5 | 1653 | 5870000 |
| 3 | Putnam County | 501.4 | 72 | 151193 |
| 4 | Bristol County | 470.4 | 650 | 2040000 |
| 5 | Union County | 480.2 | 2 | 1080 |
| 6 | Keokuk County | 481.2 | 0 | 88296 |
| 8 | Windham County | 488.0 | 731 | 720718 |
| 9 | Carbon County | 470.2 | 217 | 29761 |
| 10 | Huron County | 470.8 | 152 | 10330001 |
| 11 | Iron County | 397.9 | 89 | 65627 |
| 12 | Franklin County | 521.7 | 219 | 47146 |
| 13 | Stanley County | 436.8 | 24 | 79825 |
| 14 | Madison County | 491.4 | 1390 | 473155 |
| 15 | Huntingdon County | 433.8 | 1188 | 7470000 |
| 16 | Chenango County | 534.8 | 89 | 1577131 |
| 18 | Knox County | 475.8 | 66 | 460555 |
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(counties['TWPE Released'], counties['Cancer Incidence Rate'])
ax.set_xlabel('TWPE Released')
ax.set_ylabel('Cancer Incidence Rate')
plt.show()
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(counties['Total Pounds Released'], counties['Cancer Incidence Rate'])
ax.set_xlabel('Total Pounds Released')
ax.set_ylabel('Cancer Incidence Rate')
plt.show()
plt.figure(figsize=(10,6))
c = counties.corr()
sns.heatmap(c, annot=True, vmin=-1.0, vmax=1.0)
c
| Cancer Incidence Rate | TWPE Released | Total Pounds Released | |
|---|---|---|---|
| Cancer Incidence Rate | 1.000000 | 0.091514 | -0.085780 |
| TWPE Released | 0.091514 | 1.000000 | 0.398283 |
| Total Pounds Released | -0.085780 | 0.398283 | 1.000000 |
It's obvious from the correlation data that there is absolutely no correlation between water pollution loads released into water systems and the incidence of cancer later in life with the current data. There are many reasons why the data are not currently showing a correlation. One possibility is invalid data, missing or incorrect data. It is unclear how reliable this data is. In addition, all pollutants were included, so extraneous information may be skewing results. Last, there are many, many factors beyond water contaminants that can lead to cancer, so trying to remove those possibilities from the data is impossible.
An interesting study would look at each particular chemical and see if there is any correlation between it and cancers that are linked to it.
I would assign a grade of A to this project. I spent a significant amount of time working on it (approximately 3 full days of work). In addition, I challenged myself by finding my own data sources and combining them into an analysis. I also challenged myself by learning how to do choropleth map of cancer rates. Last, I have included all the required elements, including NumPy, Pandas, Seaborn, and Matplotlib, used in various ways.
I found this assignment to be the most challenging of all the assignments. So much interesting data exists, and I spent a large amount of time considering various datasets. It took physical effort to stop myself and choose a project. But the most challenging aspects involved using my own data and discovering issues that can arise when creating your own projects from scratch. For example, I discovered problems due to datatypes that I hadn't anticipated. When I tried to run the sum aggregation, it didn't work. It took a lot of Google fu to realize that I was trying to aggregate an object type and needed to covert it to a numeric type. Another example involved trying to merge data that had different indexes. I kept getting NaN values after the merge, and it took another long search to find the answer was aligning the indexes.
Even though it was incredibly challenging, I feel I have learned an incredible amount this week. The most important thing I learned is that I still have a lot more learning to do when it comes to everything we covered this semester.